Case Study: Thera Bank | Credit Card Users Churn Prediction

*Model Tuning Techniques*


Problem Statement:

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective:

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Description:


Importing necessary libraries

Load and overview the dataset


Summary of the dataset

Let's check the count of each unique category in each of the categorical variables.

EDA

Univariate Analysis

Observarions on Customer_Age

Observarions on Months_on_book

Observarions on Credit_Limit

Observarions on Total_Revolving_Bal

Observarions on Avg_Open_To_Buy

Observarions on Total_Trans_Amt

Observarions on Total_Trans_Ct

Observarions on Total_Ct_Chng_Q4_Q1

Observarions on Total_Amt_Chng_Q4_Q1

Observarions on Avg_Utilization_Ratio

Observarions on Attrition_Flag

Observarions on Gender

Observarions on Education_Level

Observarions on Marital_Status

Observarions on Income_Category

Observarions on Card_Category

Observarions on Total_Relationship_Count

Observarions on Months_Inactive_12_mon

Observarions on Contacts_Count_12_mon

Bivariate Analysis

As expected:

Attrited Customer (account is closed) shows some partner as:

Let's define one more function to plot stacked bar charts

Attrition_Flag vs Gender

Attrition_Flag vs Education_Level

Attrition_Flag vs Marital_Status

Attrition_Flag vs Income_Category

Attrition_Flag vs Card_Category

Attrition_Flag vs Dependent_count

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Contacts_Count_12_mon

Multivariate Analysis

Total_Amt_Chng_Q4_Q1 VS Months_Inactive_12_mon per Attrition_Flag

Total Trans Amt VS Total Revolving Bal per Attrition_Flag

Total Trans Amt VS Total_Amt_Chng_Q4_Q1 per Attrition_Flag

Total Trans Amt VS Total_Relationship_Count per Attrition_Flag

Total Revolving Bal VS Total_Relationship_Count per Attrition_Flag

Total Revolving Bal VS Card Category per Attrition_Flag

Total Trans CT VS Income Category per Attrition_Flag


Data Preparation for Modeling


Missing-Value Treatment

Creating dummy variables for categorical variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting that the customer is going to churn and the customer doesn't churn - Loss of resources
  2. Predicting that the customer is not going to churn and the customer churn - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

1. Model Building: Default parameters

Let's evaluate the model performance by using KFold and cross_val_score

2. Model building - Oversampled data

3. Model building - Undersampled data

The best Cross Validation performance happened using oversampling, on the other hand undersampling is showing more consistenc between Cross Validation and Validation Set. Considering that on oversampling all the validation scores were not on the range of cross validation and with low score, we gonna choose use the data set undersampling where validation score lay on the range of cross validation, hence is the best aproach for us.

4. Choose to tune 3 models

Our top 3 better expected performance on unseen data considering Cross Validation and Performance on Validation set happens with UNDERSAMPLING data set on the following models:

  1. XGBoost: Is performing better than all models, the boxplot shows a consistet performance with mean 95.5 and range between 94 to 96.5 and no outliers.
  2. GBM: Boxplot shows a consistent performance with 2 outliers, mean 94 and max around 96.5.
  3. Adaboost: This model is showing consistenc and no outlier, with range between 0.91 to 0.945 and mean 93

We will tune XGB, GBM and Adaboost models using Randomized Search CV.

Adaboost

Adaboost Default

Gradient Boosting

GB Default parameters

XGBoost

XGB Default

6. Model Performances

Performance on the test set


Pipelines for productionizing the model


Note:


Insights & Recommendations